RESEARCH PAPER 


S-ProvFlow. Storing and Exploring Lineage Data as a 
Service 


Alessandro Spinuso", Malcolm Atkinson? & Federica Magnoni? 


‘Koninklijk Nederlands Meteorologisch Instituut, De Bilt, Utrecht 3731 GA, The Netherlands 
University of Edinburgh, Edinburgh, Edinburgh EH8 9AB, United Kingdom 


‘Istituto Nazionale Geofisica e Vulcanologia, Rome, Lazio 00143, Italy 


Keywords: Provenance; Productivity; Workflows; Human-in-the-loop; Visualisation 


Citation: Spinuso, A, Atkinson, M., Magnoni, F.: S-ProvFlow. Storing and exploring lineage data as a service. Data Intelligence 
A(2), 226-242 (2022). doi: 10.1162/dint_a_00128 
Received: July 28, 2021; Revised: December 3, 2021; Accepted: February 4, 2022 


ABSTRACT 


We presenta set of configurable Web service and interactive tools, s-ProvFlow, for managing and exploiting 
records tracking data lineage during workflow runs. It facilitates detailed analysis of single executions. It 
helps users manage complex tasks by exposing the relationships between data, people, equipment and 
workflow runs intended to combine productively. Its logical model extends the PROV standard to precisely 
record parallel data-streaming applications. Its metadata handling encourages users to capture the application 
context by specifying how application attributes, often using standard vocabularies, should be added. These 
metadata records immediately help productivity as the interactive tools support their use in selection and 
bulk operations. Users rapidly appreciate the power of the encoded semantics as they reap the benefits. This 
improves the quality of provenance for users and management. Which in turn facilitates analysis of collections 
of runs, enabling users to manage results and validate procedures. It fosters reuse of data and methods and 
facilitates diagnostic investigations and optimisations. We present S-ProvFlow’s use by scientists, research 
engineers and managers as part of the DARE hyper-platform as they create, validate and use their data-driven 
scientific workflows. 
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1. INTRODUCTION 


The provenance of workflow executions, with its wealth of metadata, needs to be stored, processed, 
made comprehensible and productively used. The S-ProvFlow® system supports acquisition and exploration 
of lineage and provenance data from workflows. It includes a database,a Web service and two interactive 
tools. Myers et a/. [1] demonstrated that productive use motivates adoption and improves metadata quality. 
This improves research objects as their metadata has been refined through use enabling accurate replication— 
essential for CWFR®. Users must be actively engaged through tools that improve their productivity. Much 
contemporary work enables researchers to author methods abstractly. Those methods are optimally (re) 
mapped retaining users’ specified semantics as technology improves. We focus on the reverse information 
flow, from those evolving systems back to research developers and application experts. We satisfy similar 
demands for simplicity, comprehensibility and stability. This delivers FAIR benefits to all users by facilitating 
the access and use of the standardised and persistent provenance traces, and the data, workflows and 
components accessible via those traces. Our provided interactive tools show this potential. This can be 
exploited by many other tools, systems and workflow technologies. The metadata used is a mix of generic 
widely agreed terms, such as system and software identities and geo-spatial references, with discipline and 
community specific terms representing their knowledge infrastructure in which their work is framed by 
general (often global) hard-won agreements ([2], page xv). As we illustrate, this is essential for grounding 
the information in terms usable by them, their peers and successors via their working practices and digital 
systems. There is a corresponding spectrum of persistent identification. The traces once judged as valuable 
by domain experts can be allocated standard PIDs. The objects and components they touch will be identified 
in standard ways or in ways established and sustained by a community. The refinement of the metadata 
leading to fully described products and experiments can be conducted incrementally, producing updated 
and refined digital objects. To achieve this, the interactions occurring between data and workflows at 
different states of maturity should be evaluated within a collaborative and evolving ecosystem, as we will 
show in Section 4.1. This prototypes a path that other CWFR elements will need to follow to gain wide 
adoption incrementally in the research communities that have substantial intellectual, technical and political 
investments already. 


Provenance standards are a lingua franca encoding information gathered from multiple layers of a 
computational workflow. Our comprehensive architecture encompasses generation, management and access 
to provenance data. In previous work [3, 4], we have already addressed how research developers tune the 
generation of provenance by injecting metadata instruction in the workflow’s operators. This is achieved via 
an Active provenance framework that allows customisation of fine-grained lineage, fostering interaction 
between users and a workflow’s provenancemechanisms [1]. The framework has been demonstrated in the 
context of a general purpose analysis library for data-streaming pipelines, dispel4py [5, 6]. Depending on 
the requirements, users can instruct a dispel4py workflow to extract metadata according to a kernel of 
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agreed terms specified by initiatives within their domain, as we mention in Section 4, as well as experimental 
ones. These will be injected into the lineage, alongside those general purpose attributes that characterise 
the provenance model. In this follow-up paper we present instead a lineage web service and tools that 
develop understanding and fluency by delivering immediate benefits. We show this being used for active 
monitoring and retrospectively for diverse purposes. 


S-ProvFlow exploits standard models [4, 3] to efficiently manage metadata thereby fostering comprehension, 
usability and interoperability. Parameters, intermediate outputs and final resultsare automatically annotated 
making the traces and results discoverable and actionable, thereby offering drill-down and FAIR access 
to a workflow’s outcome and enactment history. The system is the provenance service of the DARE® 
platform [7, 8]. It was first released during the VERCE project®. In this paper we introduce its web API and 
the interactive tools. Thanks to the management of the customised provenance traces produced by the 
excution of the workflow, the interactive tools deliver understanding and control of “live” processes and 
encourage sharingand reuse of data and methods. We have demonstrated the practical value of provenance 
when evaluating the basis for evidence and for managing large numbers of runs. S-ProvFlow has given us 
eight years of experience working with domain experts, research developers and the teams who support 
their use of data and sophisticated computation. We draw on this experience to identify vital requirements 
for CWFR. These are essential for quality and sustainability of these multi-disciplinary professional 
collaborations that yield decision support we all depend on. 


2. RELATED WORK 


Methods for querying and visualising provenance have been developed [9, 10, 11], addressing specific 
scenarios, workflow systems, or more general mechanisms, such as ProvStore [12].Their storage technologies 
are similarly specialised, e.g., to represent Directed Acyclic Graphs (DAG), PBase used Neo4j® for the 
ProvONE model. It enables queries on workflows’ traces thatwere previously uploaded onto the system in 
VisTrails XML format. The interrogations include lineage and execution queries focusing on the involvement 
of processes within runs and on the relationships between data and processes. For our work we chose the 
well established document-store, MongoDB [13], to give priority to use cases that access the provenance 
information using data properties and process parameters. This enabled S-ProvFlow’s discovery functionalities 
to exploit the lineage produced via the Active framework [3], reflecting each user’s context and their 
metadata. We took on challenges in [14, 15] also addressed by [10], where provenance is annotated with 
rich metadata and configuration parameters that rely on flexible vocabularies. 


The interactive access to provenance data is determined by the quality of visualisation. Map Orbiter [16], 
for instance, summarises the DAG which can be expanded on demand. 5-ProvFlow, instead, starts with a partial 
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visualisation of the lineage graph. It can be searched or browsed to show process outputs, allowing users 
to expand and navigate the derivation graph interactively. Alternative techniques, such as the Sankey diagram 
used by PROV-O-Viz [17] represent the magnitude of flows between activities. Others map the provenance 
graph to radial diagrams [18]. This technique, used for the analysis of parallel I/O [19], is also used by 
InProv [11], pioneering its adoption for provenance visualisation for recordings obtained by PASS 
(Provenance-aware Storage System) [20, 21]. Recently this has been combined with computer graphics to 
improve the visual efficiency [22]. It reduces visual clutter by bringing the most important nodes to the 
front. ProvStore [12, 23] offers radial diagrams to represent provenance relationships. In S-ProvFlow users 
compose queries to focus on specific metadata terms and values, customising views for single computations 
or for multiple users and runs. We believe that platforms that aim at the publication and reproducibility 
of research artifacts, such as Whole Tale [24], could benefit from the adoption of S-ProvFlow. This would 
complement their capability to manage preconfigured computing environments, where users re-execute 
the workflows, with services tomonitor, review and better represent the results. For instance, by performing 
in S-ProvFlow keyprovenance queries that select and render portions of the outputs of the overall workflow. 


3. S-PROVFLOW: ARCHITECTURE AND COMPONENTS 


5-ProvFlow has multiple components delivering a comprehensive provenance infrastructure. It provides 
a Web API and tools for interactive exploration of lineage data by users (Figure 1). The underlying 
provenance model S-PROV® [3] builds on the PROV? and ProvONE® recommendations. This interoperable 
representation adds elements to encode complex lineage patterns to include process delegation, distribution 
and statefulness of the workflow operators. In the recent deployment for the DARE platform® [25], the 
components are organised as micro-services optimised by decoupling via message queues, delivering 
resilience to failures and support for authentication infrastructures. 


®  http://purl.org/s-prov-v1-dev 

®  http:/www.w3.org/TR/prov-dm/ 

® https://purl.dataone.org/provone-v1-dev 
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Figure 1. Schematic architecture exploiting the 5-ProvFlow system for lineage acquisition and exploration. The 
graphical interface supports direct access to result data, while the API offers an export service to disseminate 
provenance information, adopting interoperable formats, to general purpose databases, such as triple stores. 


3.1 Lineage API 


To facilitate the exploitation of the lineage data, S-ProvFlow exposes a set of high-level interrogation 
methods. We present the use cases and relevant implementation details. 


Layered Workflow Activity: The execution of a workflow can be examined at different levels of detail, 
from high-level views based on the classification and functional grouping of workflow elements down to 
a single element with multiple processing instances. Clients can easily switch between views. We use the 
recommended MongoDB data denormalisation to exploit its powerful aggregation framework®. Every time 
a process is invoked, it also generatesa lineage document. Each document contains the same detailed 
metadata, e.g., the location of the execution, the characteristics of the software and the role of the process 
as a component of ahigher-level function, which may be composed of more operators. This goes alongside 
dynamic metadata, such as execution time, data volumes, reference values and domain-metadata as specified 
for the run or process. When queried, the information is aggregated without joins, obtaining complete 
processing and functional information. The dynamic data is processed and aggregated to deliver the level 
of abstraction clients have selected. The more documents that are aggregated, in respect to a property of a 
process, the lower the granularity of the information presented [4]. 


® https://docs.mongodb.com/v4.0/aggregation/ 
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Search: The API enables the discovery of data and experiments by performing searches on metadata 
terms, as well as on the semantic and functional abstractions characterising processes. Intuitive metadata 
expressions are used to combine value-lists and value-ranges to search for data and workflow executions. 
To accelerate metadata searches we uses compound indexes® on dictionaries of the form {key:<term>, 
value:<val>}. This allows us to efficiently query dynamic vocabularies, without hitting the limit of the 
maximum number of indexes allowed. 


Data Lineage: The API allows users to navigate the data derivation graph interactively by specifying 
how much depth should be retrieved at each step. Also in this case, the denormalised approach to the storage 
combined with linked lists, to represent the PROV wasDerivedFrom relationship between data entities in different 
documents, allows us to easily obtain information about the processes, while navigating through the data 
derivations. Clients combine graph traversals with metadata filters to view data whose ancestors’ properties 
match their requirements. 


Aggregations: The API provides high-level summary methods to extract comprehensive information 
about single runs or collections of runs. One method covers processing dynamics of a single run, showing 
data transfers between processes at a configurable granularity. Another reveals collaborative dynamics across 
runs, such as data-reuse between workflows, infrastructures and users (see Section 4.1). One last method 
extrapolates information on the metadata in the archive, summarising their role and occurrence for users 
and workflows. This is used in interfaces to produce personalised recommendations on the terms that could 
be used in the queries, as shown in Figure 2. All of these methods depend on the powerful aggregation and 
map-reducecapabilities of the MongoDB technology. 


3.2 Interactive Tools 


Research-developers and administrators may use the API for different purposes. For instance, while the 
former may validate and analyse the lineage of their experimental results, the latter may monitor how 
infrastructures and data are exploited by users and applications. S-ProvFlow provides tools tailored for both 
use Cases. 


The Monitoring and Validation Visualiser (MVV), assists the users in the fine-grain interpretation of the 
provenance records in order to understand dependencies. It allows them to selectand configure viewpoints 
by specifiable searches over domain metadata, offering data previewsand navigation of data dependency 
graphs. Detailed run-time diagnostics differentiate between stateless and stateful processes, the latter 
highlighting data retention by operators such as accumulators and mergers. The visual components of the 
tool are depicted in Figure 2, showing the lineage of a particular workflow in seismology. Users search for 
workflow executions and data elements by formulating queries using a simple syntax that facilitates metadata 
searches over ranges or lists of values. Terms may refer to standard vocabularies or be introduced 
experimentally to evaluate specific applications. Advanced filters on the search operate upon request, 
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reducing the results based on the metadata values of the ancestors in each data derivation tree. The search 
results can be investigated interactively browsing through the metadata describing both products and 
processes. Products can be volatile, thereby only described by their metadata. However, when the provenance 
is associated with actual resources, the tool will show this by offering users preview and download pop-up 
functions. Every search is assisted through hints, that are updated via the incremental analysis of the whole 
provenance archive via one of the API’s aggregation methods (Section 3.1). 
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Figure 2. Monitoring and Validation Visualiser combined access to data products and metadata. This example 
represents the interaction with the provenance information of a run of the Waveform pre-processing workflow 
introduced in Section 4. The Runtime Monitor shows the list of active processes. It reports the quantity of data 
produced, the working node, system or application messages. In the Data Dependency graph the yellow circle 
indicates that the provenance entity links to a concrete data resource. Runs and products can be searched by using 
metadata expressions. 


The Bulk Dependency Visualiser (BDV) offers broader perspectives on computational characteristics as 
well as collaborative interactions. It combines radial diagrams with configurable grouping and hierarchical 
edge bundle techniques [22]. It allows its users to dynamically adjust viewing and grouping controls to 
uncover aspects of the distribution of the processing. This is obtained thanks to the underlying S-PROV 
model. Its core provenance data model encompassessystem level information in the context of the particular 
application and workflow’s operator semantics, providing an interoperable representation of the workflow’s 
resource-mapping to a target cluster. The BDV uses these capabilities of the model to aggregate data and 
accommodate views that provide insights on the workflow’s enactment at a configurable level of detail, as 
shown in Figure 3. Besides the exploration of single executions, the BDV can also produce overviews of 


232 Data Intelligence 


202211.00443v1 


chinaXiv 


ChinaXivA (ERAT! 


S-ProvFlow. Storing and Exploring Lineage Data as a Service 


large experiments that involve more researchers who reuse and exchange data via different workflows, with 
the progressive refinement of the metadata. This visualisation technique could be applied to evolving FDOs 
to accommodate views that put in the forefront experiments whose metadata has been incrementally refined 
by peers, to better characterise the methods and the results that contribute to a particular study. This last 
use case is discussed in Section 4.1 in the context of the particular seismological application. 


drachentels-060 | AAAA À drachenfels-070 


drachenfels-060 


« drachenfels-070 


drachenfels-061 


(a) 


Figure 3. BDV Single workflow visual analytic. Fine-grained radial perspectives for an earthquake simulation 
workflow. The diagrams indicate overall data transfer between (a) workflows’ processes, as well (b) their single 
invocations in a streaming execution. The colour-coded legend describes the amount of data transferred. The 
vertices are labeled respectively with processes and invocations ids and are grouped by the computational node 
they run in the HPC cluster (Drachenfels at Fraunhofer SCAI). By hovering on the nodes, incoming (red) and 
outgoing (green) streams are highlighted. 


4. TEST CASE: SEISMIC RAPID ASSESSMENT 


Computational seismology is presently facing the challenge of managing increasing amountsof recorded 
and simulated data, downloaded from rich data archives or simulated by accurate and computationally 
sophisticated tools. To analyse and exploit these, users need to easily customise the available methods to 
adroitly explore new opportunities and to meet urgent challenges. Robust provenance-driven tools are 
needed to smartly organise storage of the data and related metadata, and to encourage their exploration, 
combination and reuse. This promotes reproducibility and error detection in scientific experiments, and 
relates well to the general effortsof international organisations handling the wealth of Earth science data®. 
These needs become critically urgent after large seismic events, since reliable and immediate outcomes are 
fundamental to guide emergency response. In this context, the rapid assessment of seismic ground motion 


2 eg., http://www.orfeus-eu.org/data/eida; https://www.fdsn.org/; http://ds.iris.edu/ds/ 


Data Intelligence 233 


202211.00443v1 


chinaXiv 


ChinaXivA (ERAT 


S-ProvFlow. Storing and Exploring Lineage Data as a Service 


(RA) is a key application in computational seismology, exposing requirements that embody all the 
aforementioned needs. It also represents a good example to highlight the applicability of our framework to 
different scientific contexts (e.g., [26]). 


After a large earthquake it is essential to rapidly simulate the propagation of seismic waveforms in 
surrounding areas and quantitatively estimate specific ground motion parameters to assess the earthquake’s 
impact. Then, comparing and integrating synthetic information with recorded ground motion data improve 
the understanding of ground response to the earthquake. 


The RA theoretical foundations and applicable procedures are well established, and can be exemplified 
by the high-level steps shown in Figure 4 and detailed in [3, 27, 28]. Some steps are more specific for this 
seismological test case, some others can be reused (with needed adaptations) in other fields (e.g., volcanology, 
climate sciences; [26]). All the steps of this applicationrequire to be traceable and to have descriptive and 
explorable metadata in order to support the research scientists in checking, reusing and sharing their 
methods and findings. 


difference 


Rapid Ground Motion 
Assessment (RA) 


MPI Simulation 


Run waveform 
simulation 


ititii 


Compare/integrate 
Peak Ground | ; 
Waveform synthetic and 
Choose/upload Motion y metadata 
ismic source Pre-processing observed ground " 
se sou Parameters provenance 
(point or fault) motion data 


Figure 4. The Rapid Assessment (RA) method analysing the impact of an earthquake composes re-usable tasks 
that run at different scales and require human supervision and intervention. The image highlights RA data-analysis 
steps that exploit the benefits of provenance generation and exploitation. 


A fundamental step of the RA workflow is the Waveform pre-processing stage, which prepares seismological 
data by making simulated and recorded traces consistent and comparable. It is used in many other 
seismological applications (with possible variants in the sub-steps) such as inversion for seismic source 
parameters, seismic tomography and noise cross-correlation analyses (e.g. [29, 30, 31]), and similar 
processing steps are required for numerous geophysical, and in general scientific, applications (e.g., geodesy, 
climate sciences, etc.). The system presented in this work guarantees that the results have associated 
provenance information about all the preprocessing steps (with related parameterisations) that they went 
through (Figure 2), including data properties after each step. These enable the detection of errors in analyses, 
and the comparison of different ways of preparing the data and of their effects on the produced results such 
as the ground motion parameter estimates. 
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After pre-processing, the RA analysis requires the extraction of the ground motion parameters from 
both the synthetic and recorded seismograms for subsequent comparison. In Figure 5 we present the PGM 
(Peak Ground Motion) Parameters workflow for a single seismic stationand channel. Our implementation 
applies the combined analysis in parallel on both synthetic and observed data, keeping track of the fine- 
grain steps they went through and of the acquired metadata. This allows researchers to trace the processing 
back in case of an error, to combine and compare large amounts of data, and to discover intermediate 
step results. 


NormPE 


PeakGrou 
streamPr port = output_mean 


station = AQU, channel = EHR 
PeakGrou 


NormPE 


streamPr 


port = output_mean 


d 


station = AQU, channel = HXR 


filename = AQU_mean.json 


Figure 5. Lineage Precision: The image shows the lineage of a file AQU mean.json (yellow circle) produced by 
the WriteGeoJSON process as a result of the PGM workflow. The wasDerivedFrom relationships (arrows) show that 
the file was correctly derived from the mean norm values of the observed and synthetic data channels of the same 
seismic station AQU, respectively, EHR and HXR. The light blue circle indicates a stateful derivation, revealing that 
the Match operator had stored the input data produced by a PeakGroundMotion process in its internal state, before 
matching the data. 


Drawing on a strong scientific background, the Waveform pre-processing and PGM Parameters workflows 
benefit significantly from S-ProvFlow, by having intermediate results properly managed and described 
by usable metadata. These are typical seismological metadata, which adhere to the recognized standards 
in the field and are also linkable to well-established infrastructures in the wider Earth science field 
(e.g., [32, 33], EIDA-ORFEUS®, FDSN®, IRIS®). 


Moreover, there are new, user-customized metadata learned from the workflow executions, providing a 
database continuously enriched and up-to-date. Users can thus check the results of eachstep even if outputs 
are not stored. The results and executed steps are traceable in the contextof the generating process and 
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workflow. This fosters their retrieval to allow their comparison and reuse, and to diagnose, validate and 
fine-tune new experimental analyses. All this makes our metadata and provenance management approach 
deeply customizable for specific applications and, at the same time, easily adaptable to a multiplicity of 
scientific workflows in diverse scientific fields, also for operational scenarios [28]. 


4.1 Collaborative Interactions and Metadata Refinements 


The workflow introduced above involves scenarios where collaborative interactions are established between 
seismologists, by means of reusing data between different stages of the analysis. Figure 6 shows a radial 
diagram that displays runs executed by two users. The right half of the diagram shows interlinked workflows 
organised into separated radiants, according to their conceptual tasks. These were described by specifying 
concepts and metadata to contextualise the methods involved. In contrast, the left side has a poor conceptual 
characterisation, typical of the early phase of exploration, that results in chaotic and harder to visually 
analyse provenance graphs. The customisation of the view based on metadata allows users to visually 
restrict the scope to particular data properties, suggesting, for instance, the reuse of the results of some runs 
or, depending on the circumstance, the discard of those that show little contribution. The combination of 
connections and colours make all these different interactions evident. Interactively adjusting viewing 
parameters with immediate feedback enables users to tune the display until they see what they are looking 
for. Such visualisation suggests ways to help communities and research managers of computational 
infrastructures obtain an immediate overview of the interactions across different sites. Especially in the 
context of CWFR framework, it also shows how the metadata used in a study have been extended and 
refined over time, towards final and fully described experiments. These comprehensive diagrams are a 
significant help when coordinating a long-running or extensive research campaign. They may be used for 
public outreach and within official reports, providing a sensible and tangible perception of the actual 
exploitation of adistributed data-intensive platform, serving yet another category of consumers through the 
sameunderlying provenance model. 
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Figure 6. BDV Radial diagram highlighting data reuse between different workflows of the RA use case and 
refinement of the associated metadata. The runs were selected by interactively querying the provenance archive to 
find those using a common set of seismic stations. The runs were performed by two different users (blue and black 
vertex). The right half presents interlinked workflows with a refined contextualisation of the lineage. It shows how 


the improvements of the metadata applied over time by one of the users, yield more informative content and better 
visualisation. 


5. CONCLUSIONS AND FUTURE WORK 


The incremental improvement of the relevance and usefulness of provenance increases confidence in the 
possibilities for its exploitation, thereby promoting awareness of its importance. The provenance traces 
themselves are candidate FDOs [34]. They also provide critical actionable information about the workflows, 
e.g., to assess their validity, usefulness and efficiency, and abouteach enactment and its data products, all 
of which may eventually be FDOs, but this depends on stimulating adoption of CWFR standards, which 
will inevitably be an incremental process following the kinds of path we have pioneered. Through the active 
participation of the experts, provenance is ready made for validation and results’ management use cases. We 
explored technical solutions and standard models to be applied to the next generation of WMSs [35], which 
have to encompass the challenges, identified by the FDO®, concerning the evaluation and traceability of 
experimental results. This depended on co-design and co-development with domain experts in communities 
that had long-established global knowledge infrastructure with corresponding standards and agreed 
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practices. In this paper, we have focused on our work with computational seismologists. In a companion 
paper [36] we report how provenance management empowers thereproducibility of interactive workspaces, 
beyond workflows, with use cases which also include climate scientists. Edwards [2] makes clear the 
extensive and complex knowledge infrastructure that has taken over a century to incrementally develop. 
The standards and technologies we produce have to prove their value alongside those established systems 
before they will begin to penetrate into their established working practices. We consider that improving the 
quality and use of provenance by delivering immediate benefits to a broad range of users will encourage 
adoption and engagement. Their improved productivity and the improved quality of decision support will 
motivate the investment in sustainably collecting and preserving vital provenance records with sufficient 
content. It will have long-term benefits for the quality of research procedures and the evidence they produce 
that underpins life-critical decisions. We have adopted services and toolsdeveloped around our framework 
to demonstrate its effectiveness. The developed interactive tools provide easy access, visualization and 
navigation through the provenance information and metadata, also offering direct links to physical data 
resources. Researchers exploit these tools to improve their science, by quickly detecting and solving 
anomalies, and optimizing the combination of multiple runs and data for complex applications. Research 
engineers and developers are facilitated in improving the resource and data exploitation (Section 4.1.) We 
illustrated many ofthese capabilities in relation to the use cases of a real application in seismology. In future 
work, we want to address and improve the preliminary results [25] of enabling the import of CWLProvinto 
s-ProvFlow, in order to scale the benefit of $-ProvFlow to a wider collection of WMSs. 


Similar interdisciplinary co-development demonstrating immediate benefits will test and develop the other 
aspects of the canonical workflow technologies and incentivise widespread experimental adoption leading to 
sustained growth in quality, capabilities and adoption. We thereforeanticipate collaborations embedding in 
many research contexts to improve the standard includingthe provenance, developing the tools and work 
environments it enables, to build momentum for its adoption. This will deliver more FDOs corresponding 
to all aspects of the supported research and extend the use of FDO standards and representations deeper 
into the established practices. 
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